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Assessing  Essential  Dimensionalitj  of  Real  Data 


Abstract 


The  purpose  of  this  article  is  to  validate  the  capability  of  DIMTEST  to  assess 
essential  dimensionality  of  the  model  underlying  the  item  responses  of  real  tests  as  opposed 
to  simulated  tests.  A  variety  of  real  test  data  from  diHerent  sources  are  used  to  assess 
essential  dimensionality.  Based  on  DIMTEST  results,  some  test  data  are  assessed  as  fitting 
an  essential  unidimensional  model  while  others  are  not.  Essential  unidimensional  test  data, 
as  assessed  by  DIMTEST,  are  then  combined  to  form  two-dimensional  test  data.  The 
power  of  Stout's  statistic  T  is  examined  for  these  two-dimensional  data.  It  is  shown  that 
the  results  of  DIMTEST  on  real  tests  replicate  findings  from  simulated  tests  in  that  the 
statistic  T  discriminates  well  between  essential  unidimensional  and  multidimensional  tests. 
It  is  also  highly  sensitive  to  major  abilities  while  being  insensitive  to  relatively  minor 
abilities  influencing  item  responses. 

Subject  terms:  DIMTEST,  essential  independence,  essential  dimensionality, 
unidimensionality,  multidimensionality,  item  response  theory. 
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Most  of  the  cuirently  used  item  response  theory  (IRT)  models  require  the  assumption 
of  unidimensionality.  From  the  strict  IRT  perspective,  unidimensionality  refers  to  one,  and 
only  one,  trait  underlying  test  items.  Yet,  it  is  a  well  known  fact  that  items  are  multiply 
determined  (Humphreys,  1981, 1985, 1986;  Hambleton  &  Swaminathan,  1985,  chap.  2; 
Reckase,  1979,  1985;  Stout,  1987;  Traub,  1983).  Hence  from  the  substantive  viewpoint,  the 
assumption  of  unidimensionality  requires  that  the  test  items  measure  one  dominant  trait. 
Stout  (1987)  coined  the  term  essential  unidimensionality  to  refer  to  a  particular 
mathematical  formulation  of  a  test  having  exactly  one  dominant  trait.  Dimensionality  is, 
however,  determined  by  the  joint  influence  of  test  items  and  examinees  taking  the  test 
(Reckase,  1990).  In  addition,  extraneous  factors  such  as  teaching  methods,  anxiety  level  of 
examinees,  etc.,  may  also  influence  the  dimensionality  of  the  given  item  response  data. 

Thus  dimensionality  has  to  be  assessed  each  time  a  test  is  administered  to  a  new  group  of 
examinees. 

Factor  analysis  has  traditionally  been  the  most  popular  approach  to  assess 
dimensionality  (Hambleton  k  Traub,  1973;  Lumsden  1961).  Factor  analysis,  despite  its 
serious  limitations  to  analyze  dichotomous  data  (for  example,  see  Hulin,  Drasgow,  and 
Parsons,  1983,  chap.  8),  has  been  the  popular  method  to  study  the  robustness  of  the 
unidimensionality  assumption  (Drasgow  &  Parsons  1983;  Harrison,  1986;  Reckase,  1979). 
There  are  a  number  of  other  promising  methods  proposed  and  used  in  varying  degrees  to 
assess  dimensionality — to  name  a  few;  full  information  factor  analysis  based  on  the 
principle  of  marginal  maximum  likelihood  (Bock,  Gibbons,  &  Muraki,  1985;  TESTFACT: 
Wilson,  Wood,  &  Gibbons,  1983);  nonlinear  factor  analysis  (McDonald,  1962;  McDonald  & 
Ahlawat,  1974;  Jamshid  &  McDonald,  1983);  Holland  and  Rosenbaum’s  (1986)  test  of 
unidimensionality,  monotonldty  and  conditional  independence  based  on  contingency 
tables;  Tucker  and  Humphreys'  methods  based  on  the  principle  of  local  independence  and 
second  factor  loadings  (Roznowski,  Tucker,  k  Humphreys,  1991);  and  Stout's  (1987) 
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statistical  procedure  based  on  essential  independence  and  essential  dimensionality.  Hattie 
(1984,  1985)  has  provided  a  comprehensive  review  of  traditional  approaches  to  assess 
dimensionality,  and  Zwick  (1987)  has  applied  some  of  the  above  mentioned  recent 
procedures  to  assess  dimensionality  of  National  Assessment  of  Educational  Progress  data. 
Despite  having  several  procedures  available  to  assess  dimensionality,  there  is  no  widespread 
consensus  among  substantive  researchers  for  a  preference  for  any  method(s),  and  often 
there  is  dissatisfaction  about  assessing  dimensionality  (Berger  &  Knol,  1990;  Hambleton  & 
Rovinelli,  1986;  Hattie,  1985). 

Stout  (1987)  proposed  a  statistical  test  (DIMTEST)  to  assess  essential 
unidimensionality  of  the  latent  space  underlying  a  set  of  items.  Nandakumar  (1987)  and 
Nandakumar  and  Stout  (in  press)  have  further  modified,  refined,  and  validated  DIMTEST 
for  assessing  essential  dimensionality  on  a  variety  of  simulated  tests.  This  article 
demonstrates  the  validity  and  usefulness  of  Stout's  procedure  on  a  variety  of  real,  as 
opposed  to  simulated,  tests.  Test  data  from  different  sources  are  collected  and  used  to 
assess  essential  unidimensionality.  Essential  unidimensional  data  are  then  combined  to 
form  two-dimensional  data.  The  power  of  Stout's  statistic  T  is  examined  for  these 
two-dimensional  data. 

DIMTEST  for  Assessing  Essential  Unidimensionality 

DIMTEST,  a  statistical  test  for  assessing  unidimensionality,  is  based  on  the  theory  of 
essential  dimensionality  and  essential  independence  (Stout,  1987, 1990).  An  item  pool  is 
said  to  be  essentially  independent  with  respect  to  the  latent  trait  vector  @  if,  for  a  given 
initial  segment  of  the  item  pool,  the  average  absolute  conditional  (on  ^)  covariances  of 
item  pairs  approaches  zero  as  the  length  of  the  segment  increases.  When  only  one  dominant 
ability  0  meets  the  essential  independence  assumption,  the  item  pool  is  said  to  be 
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essentially  unidimensional.  In  contrast,  the  assumption  of  local  independence  requires  the 
conditional  covariances  to  be  zero  for  all  item  pairs  in  question.  The  number  of  abilities 
required  to  satisfy  the  local  independence  assumption  is  the  dimensionality  of  the  test. 
While  the  traditional  definition  of  dimensionality  (Lord  k  Novick,  1968)  counts  all  abilities 
required  to  respond  to  test  items  correctly  to  satisfy  the  assumption  of  local  independence, 
essential  dimensionality  counts  only  dominant  abilities  required  to  satisfy  the  assumption 
of  essential  independence  (as  opposed  to  local  independence).  DIMTEST,  using  this 
definition,  assesses  the  closeness  of  approximation  of  the  model  generating  the  given  item 
responses  to  the  essential  unidimensional  model.  Nandakumar  (1991)  describes  the 
theoretical  differences  between  traditional  dimensionality  and  essential  dimensionality  and 
establishes  through  Monte  Carlo  studies  the  usefulness  of  DIMTEST  for  assessing  essential 
unidimensionality  in  the  possible  presence  of  several  secondary  dimensions. 

To  use  DIMTEST  for  assessing  essential  unidimensionality,  it  is  assumed  that  a 
group  of  /examinees  take  an  Nitem  test.  Each  examinee  produces  a  vector  of  responses  of 
Is  and  Os,  with  1  denoting  a  correct  response  and  0  denoting  an  incorrect  response.  It  is 
assumed  that  essential  independence  with  respect  to  some  dominant  ability  @  holds  and 
that  the  item  response  functions  are  monotonic  with  respect  to  the  same  vector  @.  The 
hypothesis  is  stated  as  follows: 


H  :  djp  =  1  versus  H :  djp  >  1 
0  1  ^ 

where  d^  denotes  the  essential  dimensionality  of  the  latent  space  underlying  a  set  of  items. 

In  order  to  assess  essential  unidimensionality  of  a  given  test  data,  DIMTEST  follows 
several  steps.  The  steps  are  summarized  briefly  here  (for  details  see  Stout  1987; 
Nandakumar  k  Stout,  in  press).  First,  test  items  are  split  into  three  subtests  ATI,  AT2, 
and  PT  vnth  the  aid  of  factor  analysis  (FA)  using  part  of  the  sample  (a  sample  size  of  500 
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is  reconunended  for  this  purpose).  Items  of  ATI  axe  selected  so  that  they  all  tap  the  same 
dominant  ability.  Instead  of  using  FA,  it  is  also  possible  to  use  expert  opinion  (EO)  to 
select  items  for  ATI.  If  the  FA  method  of  selection  is  chosen,  DIMTEST  automatically 
determines  the  length  of  the  subtest  ATI.  Once  items  for  ATI  are  chosen,  items  of  AT2 
are  selected  so  that  they  have  a  difficulty  distribution  similar  to  those  of  ATI  items  (for 
details  see  Stout,  1987).  The  remaining  items  form  the  partitioning  subtest  FT. 

Second,  examinees  are  assigned  to  K  different  subgroups  based  on  their  score  on  the 
partitioning  subtest  FT.  In  other  words,  all  examinees  obtaining  the  same  FT  total  score 
are  assigned  to  the  same  subgroup.  When  the  subtest  FT  is  "long"  and  the  test  is 
essentially  unidimensional,  within  each  subgroup  examinees  are  assumed  to  be 
approximately  of  similar  ability.  When  FT  is  not  long,  the  subtest  AT2  compensates  for 
the  bias  in  ATI  caused  by  FT  being  short.  Also,  AT2  compensates  for  the  bias  in  ATI 
caused  by  the  presence  of  guessing  or  the  difficulty  factor  that  is  often  found  by  the  factor 
analysis. 

Third,  within  each  subgroup  k,  variance  estimates,  and  and  the  standard 
error  of  estimate  are  computed  using  item  responses  of  ATI.  These  estimates  are  then 
summed  across  K  subgroups  to  obtain 


Similarly,  Tg  is  computed  using  items  of  subtest  AT2.  Stout's  statistic  Tis  given  by 


The  decision  rule  is  to  reject  if  T  >  Z^,  where  is  the  upper  lOO(l-a)  percentile  of  the 
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standard  normal  distribution,  a  being  the  desired  level  of  significance. 

When  the  given  test  data  are  well  modeled  by  an  essential  unidimensional  model, 
items  of  ATI,  AT2,  and  PT  would  all  be  tapping  the  :ame  dominant  dimension.  Therefore, 
the  variance  estimates  and  a  will  be  approximately  equal  resulting  in  a  "small" 

7— value,  suggesting  the  tenability  of  On  the  other  hand,  when  the  test  data  is  not  well 
modeled  by  an  essential  unidimensional  model,  the  variance  estimate  will  be  much 
larger  than  resulting  in  a  "large"  7— value  leading  to  the  rejection  of 

Simulation  studies  (Stout,  1987;  Nandakumar,  1987;  Nandakumar  &  Stout  in  press) 
on  a  wide  variety  of  tests  have  demonstrated  the  utility  of  DIMTEST  in  discriminating 
between  one-  and  two-dimensional  tests.  Simulation  studies  by  Nandakumar  (1991)  have 
particularly  demonstrated  the  usefulness  of  DIMTEST  in  assessing  essential 
unidimensionality  vnth  the  aid  of  a  rough  index  of  deviation  from  essential 
unidimensionality.  The  tests  in  Nandakumar  (1991)  were  modeled  by  two-  and 
higher-dimensional  IRT  models  as  opposed  to  a  one-dimensional  model,  and  the  test  items 
were  influenced  by  major  and  secondary  abilities  to  varying  degrees.  For  some  tests,  the 
secondary  ability  or  abilities  influenced  a  high  proportion  of  items,  and  for  others  the 
secondary  ability  or  abilities  influenced  only  a  small  proportion  of  items.  It  has  been  shown 
that  DIMTEST  reliably  accepts  the  hypothesis  of  essential  unidimensionality,  provided  the 
model  generating  the  test  is  close  to  the  essential  unidimensional  model:  established  when 
each  of  the  secondary  abilities  influences  relatively  few  items,  or  if  secondary  abilities  are 
influencing  many  items,  the  degree  of  influence  on  each  item  is  small.  The  type-I  error  in 
these  cases  was  within  tolerance  of  nominal  level.  As  the  degree  of  influence  of  the 
secondary  abilities  increases,  however,  the  approximation  to  an  essential  unidimensional 
model  degenerates,  inflating  the  observed  type-I  error  of  the  hypothesis  of  essential 
unidimensionality.  Simulation  results  (Stout,  1987;  Nandakumar  and  Stout,  in  press)  have 
particularly  demonstrated  the  excellent  power  of  the  statistic  7  when  the  model  generating 
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the  item  responses  is  two-dimensional  (two  major  abilities)  with  correlation  between 
abilities  as  high  as  .7  and  items  jointly  influenced  by  both  abilities. 

Description  of  Data 

The  data  sets  used  in  the  present  study  came  from  different  sources.  The  U.S.  history 
and  literature  data  for  grade  ll/age  17,  from  the  1986  National  Assessment  of  Educational 
Progress  (NAEP,  1988)  test  data,  were  obtained  from  Educational  Testing  Service  (ETS). 
The  General  Science  data.  Arithmetic  Reasoning  data,  and  Auto  Shop  Information  data  for 
grades  10  and  12,  from  the  Armed  Services  Vocational  and  Aptitude  Battery  (ASVAB) 
test  data,  were  obtained  from  Linn,  Hastings,  Hu,  and  Ryan  (1987).  The  Mathematics 
Usage  test  data,  the  science  test  data,  and  the  reading  test  data  were  obtained  from 
American  College  Testing  program  (ACT). 

The  NAEP  achievement  tests  are  part  of  the  so  called  Balanced  Incomplete  Block 
(BIB)  design  with  spiraled  administration  (Rogers  et  al.,  1988)  which  allows  the  study  of 
interrelationships  among  all  items  within  a  subject  area.  Because  the  U.S.  history  and 
bterature  tests  fall  into  the  simplest  category  of  BIB  design,  it  was  relatively  easy  to 
gather  the  response  data  for  all  examinees  taking  these  tests.  Hence,  these  tests  were 
chosen  for  the  present  study.  The  items  in  each  area  (history  and  literature)  were  divided 
into  four  "parallel"  blocks  with  approximately  the  same  number  of  items.  One  block  of 
items  out  of  four  was  randomly  selected  in  each  case  for  the  present  study. 

The  U.S.  history  test  data  (HIST-A)  with  36  items  consists  of  items  requiring 
knowledge  from  different  time  periods  of  U.S.  history:  Colonization  to  1763;  the 
Revolutionary  War  and  the  New  Republic,  1763-1815;  Civil  War,  1815-1877;  the  rise  of 
modem  America,  World  War  1  1877-1920;  the  Depression,  World  War  II,  1920-1945; 
Post-World  War  II,  1945-to  the  present;  and  map  items  requiring  the  knowledge  of 
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geographical  location  of  different  countries  in  the  world.  A  31-4tem  subtest  of  HIST— A, 
named  HIST  was  aeated  (explained  in  detail  in  the  next  section)  consisting  of  all  the  items 
of  HIST-A,  except  the  five  map  items.  There  are  2428  examinees  in  the  HIST-A  and  HIST 
samples. 

The  literature  test  data  (LIT)  with  30  items  consists  of  items  requiring  knowledge 
within  four  literary  genres;  novels,  short  stories,  and  plays;  myths,  epics,  and  Biblical 
char2u:ters  and  stories;  poetry;  and  nonfiction.  There  are  2439  examinees  in  the  LIT  sample. 

The  ASVAB  tests  are  used  by  the  Department  of  Defense  Student  Testing  Program 
in  high  schools  and  post  secondary  schools.  The  Arithmetic  Reasoning  test  data  for  grades 
10  and  12,  with  30  items  each,  consists  of  items  requiring  knowledge  in  solving  arithmetic 
word  problems.  The  arithmetic  reasoning  test  sample  for  grade  10  (ARIO)  has  1984 
examinees,  and  for  grade  12  (AR12)  has  1961  examinees.  The  Auto  and  Shop  Information 
test  data  for  grades  10  and  12,  with  25  items,  each  consists  of  items  requiring  knowledge  of 
automobile,  tools,  and  shop  terminology  and  practices.  The  auto  shop  test  sample  for  grade 
10  (ASIO)  has  1981  examinees,  and  for  grade  12  (AS12)  has  1974  examinees.  The  General 
Science  test  data  for  grades  10  and  12,  with  25  items  each,  consists  of  items  requiring 
knowledge  in  solving  high  school  level  physical,  life,  and  earth  sciences.  There  are  1990 
examinees  in  the  general  science  test  sample  for  grade  10  (GSIO)  and  1990  examinees  in  the 
general  science  grade  12  (GS12)  sample. 

The  ACT  mathematics  usage  test  data  (MATH)  with  40  items  consists  of  items 
requiring  knowledge  in  solving  different  types  of  mathematics  problems:  arithmetic  and 
algebra  operations,  geometry,  numeration,  story  problems,  and  advanced  topics.  There  are 
2491  examinees  in  the  MATH  sample. 

The  ACT  reading  test  data  (READ-A)  with  40  items  consists  of  4  passages,  each 
followed  by  10  questions.  The  first  three  passages  are  taken  from  different  books  all  dealing 
with  humanities,  and  the  last  passage  is  taken  from  a  book  about  psychology.  The  first 
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passage  came  from  Of  the  Farm  by  John  Updike.  The  second  passage  came  from  Light  and 
Color  in  Natnre  and  Art  by  Samuel  Williamson  and  Herman  Cummins.  The  third  passage 
came  from  Theatre:  the  Dynamics  of  the  Art  by  Brian  Hansen.  And  the  fourth  passage 
came  from  Toward  a  Psychology  of  Being  by  Abraham  Maslow.  A  30-item  subset  of 
READ-A  named  READ  was  created  (details  in  the  next  section)  consisting  of  the  first  30 
items  of  READ-A.  There  are  5000  examinees  in  the  READ-A  and  READ  samples. 

The  ACT  science  test  data  (SCI-A)  \dth  40  items  consists  of  7  passages,  each 
followed  by  5  to  7  questions.  The  first  passage  dealt  with  the  effect  of  the  thymus  gland  on 
the  development  of  immune  system  in  mice.  The  second  passage  dealt  with  sub-surface 
ground  water  movement  and  its  effects  for  waste  disposal.  The  third  passage  dealt  with  the 
periods  of  the  pendulum  on  the  earth  and  the  moon  and  its  relationship  to  the  string  length 
and  mass  of  the  ball.  The  fourth  passage  dealt  with  the  environmental  impact  of  effluent. 
The  fifth  passage  dealt  with  a  bimetallic  catalyst  and  its  relationship  to  the  speed  of 
certain  chemical  reactions.  The  sixth  passage  dealt  with  the  views  of  two  paleontologists  on 
the  characteristics  of  dinosaurs.  And  the  seventh  passage  dealt  with  the  principals  of 
osmosis  and  osmotic  characteristics  of  3  categories  of  organisms.  A  28-item  subset  of 
SCI-A  named  SCI  was  aeated  (explained  in  the  next  section)  consisting  of  the  first  28 
items  of  SCI-A.  There  are  5000  examinees  in  SCI-A  and  SCI  samples 

In  addition,  in  order  to  examine  the  effect  of  sample  size  on  DIMTEST,  both  SCI 'and 
READ  are  randomly  split  into  four  mutually  exclusive  data  sets.  The  READ  is  split  into 
READl,  READ2,  READ3,  and  READ4 — ^with  750, 1000, 1250  and  2000  examinees, 
respectively.  Similarly  SCI  is  split  into  SCIl,  SCI2,  SCI3,  and  SCI4 — ^with  750,  1000,  1250, 
and  2000  examinees,  respectively.  In  all  there  are  22  test  data.  These  are  listed  along  with 
the  test  size  and  sample  size  in  the  first  three  columns  of  Tables  1  and  2. 
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Cieation  of  Two-Dimensioiial  Test  Data 

Three  different  sets  of  two-dimensional  test  data  from  the  content  perspective  were 
created  by  combining  responses  from  test  data  that  were  assessed  as  essentially 
unidimensional  by  DIMTEST  in  the  present  study. 

The  two-dimensional  test  data,  RS,  was  created  by  combining  responses  of  30  items 
of  READ  with  the  responses  of  6  items  of  SCI  forming  a  36-item  test  with  5000  examinees. 
The  6  items  of  SCI  are  part  of  one  of  the  passages  randomly  selected  from  its  5  passages. 
Just  as  in  the  unidimensional  case  of  READ  and  SCI,  RS  is  then  randomly  split  into  4 
mutually  exclusive  data  sets  RSI,  RS2,  RS3,  md  RS4 — '/ith  750,  1000,  1250  and  2000 
examinees,  respectively.  These  tests  are  Lsted  along  with  their  test  si2es  and  sample  sizes 
in  the  first  four  columns  of  Table  3. 

The  two-dimensional  test  data  ARGSl,  for  Grade  10,  was  created  by  combining  the 
responses  of  30  items  from  ARIO  with  the  responses  of  5  items  (randomly  selected  from  25 
item  responses)  from  GSlO.  Similarly,  ARGS2  was  created  by  combining  the  responses  of 

30  items  from  ARIO  with  the  responses  of  10  items  firom  GSlO.  The  two-dimensional  test 
data  GSARl,  for  gradel2,  was  created  by  combining  the  responses  of  25  items  from  GS12 
with  the  responses  of  5  items  from  AR12;  and  GSAR2  was  created  by  combining  the 
responses  of  25  items  from  GS12  with  responses  of  10  items  from  AR12.  These  test  data  are 
listed  along  with  their  test  sizes  and  samp^v  sizes  in  the  first  four  columns  of  Table  4. 

The  two-dimensional  test  data  HSTLITl  was  created  by  combining  the  responses  of 

31  items  from  HIST  with  the  responses  of  5  items  (randomly  selected  from  30  item 
responses)  from  LIT.  Similarly  HSTLIT2  and  HSTLIT3  were  created  by  combining  the 
responses  of  31  items  from  HIST  with  the  responses  of  8  and  10  items,  randomly  selected, 
from  LIT  respectively.  These  test  data  are  listed  along  with  their  test  sizes  and  sample 
sizes  in  the  first  four  columns  of  Table  5. 
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Ilesiilts 


Unidimensioiial  Studies 

All  the  tests  in  Table  1,  except  HIST,  READ,  and  SCI  (which  are  derived  subtests  of 
HIST-A,  READ-A,  and  SCI-A,  respectively  as  described  below),  were  initially  tested  for 
essential  unidimensionality  using  DIMTEST.  In  each  case,  500  examinees  were  randomly 
selected  from  the  given  pool  for  the  use  of  selecting  ATI  items,  using  factor  analysis.  The 
rest  of  the  items  were  used  for  computing  Stout's  statistic  T.  The  size  of  ATI  (Af)  was  also 
determined  by  DIMTEST.  For  each  test,  the  T-value  and  the  p-value  are  noted.  Table  1 
lists  the  T-  and  p-values  for  all  tests  in  the  fourth  and  fifth  columns.  The  method  of 
selection  of  the  ATI  subtest,  the  value  of  Af,  and  item  numbers  selected  for  ATI  are  listed 
in  the  last  three  columns  of  Table  1. 


Table  1  about  here 


It  can  be  seen  from  Table  1  that  the  p-values  associated  with  test  data  LIT,  ARID, 
AR12,  GSIO,  and  GS12  are  well  above  the  nominal  level  of  significance  (a(=.05),  thereby 
strongly  affirming  essential  unidimensional  nature  of  these  tests.  That  is,  the  underlying 
model  generating  the  test  data  is  judged  essentially  unidimensional.  However,  the  p-values 
associated  with  HIST-A,  ASlO,  AS12,  MATH,  READ-A,  and  SCI-A  are  well  below  the 
nominal  level  of  significance  of  .05,  thereby  strongly  affirming  the  multidimensional  nature 
of  these  test  data.  For  these  tests  where  p-values  were  below  the  nominal  level,  the  nature 
of  multidimensionality  was  further  explored. 
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When  the  test  data  are  essentially  unidimensional,  items  of  ATI  are,  by  logic,  of  the 
same  dominant  dimension  as  the  rest  of  the  items;  therefore,  DIMTEST  does  not  reject  the 
null  hypothesis.  When  the  test  data  is  not  unidimensional,  however,  the  items  of  ATI  are 
dimensionally  different  from  the  rest  of  the  items,  and  DIMTEST  rejects  the  null 
hypothesis  of  essential  unidimensionality.  Following  this  reasoning  for  tests  where  p-values 
were  very  low,  the  content  of  items  of  ATI  were  examined.  Table  1  shows  that  for 
HIST-A,  items  12  through  16  and  item  6  were  selected  for  ATI.  Upon  studying  the  content 
of  these  items,  it  was  found  that  items  12  through  16  were  homogeneous  and  differed 
dimensionally  from  the  rest  of  the  items  of  HIST-A;  these  5  items  require  the  knowledge  of 
location  of  different  countries  on  the  world  map  (map  items),  while  the  rest  of  the  items 
deal  with  U.S.  history.  It  is  also  possible  in  theory  that  these  items  were  selected  for  ATI 
due  to  chance  alone.  In  order  to  test  for  this,  DIMTEST  was  applied  on  the  given  sample  of 
2428  examinees  100  times  repeatedly,  each  time  randomly  splitting  2428  examinees  into 
two  groups  of  500  and  1928  examinees.  That  is,  ATI  items  were  selected  repeatedly  on 
different  random  samples  of  500  examinees  each.  The  resampling  results  showed  that  items 
12  through  16  were  consistently  selected  for  ATI.  In  addition  to  these  items  one  or  two 
more  items,  which  varied  from  run  to  run,  were  selected  from  the  rest  of  the  items.  Hence 
it  was  concluded  that  the  map  items  are  dimensionally  different  from  the  rest.  A  subset 
HIST  was  formed  consisting  of  all  items  of  HIST-A  except  for  map  items.  It  can  be  seen 
from  Table  1  that  the  p-value  associated  with  HIST  (p=.095)  shows  evidence  of  essential 
unidimensionality.  Furthermore,  from  the  content  perspective,  items  of  ATI  do  not  form  a 
set  that  is  dimensionally  different  from  the  rest  of  the  items  of  HIST. 

A  similar  phenomenon  was  observed  with  test  data  READ-A  and  SCI-A.  For 
READ-A,  the  last  10  items  (items  followed  by  the  last  passage)  formed  part  of  subtest 
ATI.  Again  these  same  10  items  formed  part  of  ATI  in  repeated  resampling  applications  of 
DIMTEST.  Upon  studying  the  content  of  these  items,  it  was  found  that  these  10  items 
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tapped  "psychology”  content  area  which  is  different  from  the  "literature,"  tapped  by  the 
first  three  passages.  Another  possibility  is  that,  since  these  are  the  last  10  items  of  reading 
test,  speededness  could  have  caused  the  secondary  dimension.  Based  on  these  observations, 
it  was  concluded  that  these  items  were  dimensionally  different  from  the  rest,  and  a  subset 
READ  was  formed  consisting  of  first  30  items  of  READ-A.  It  can  be  seen  from  Table  1 
that  the  p-value  associated  with  READ  (p=.32)  shows  strong  evidence  of  an  essential 
nnidiTnpnsinna.1  model  Underlying  the  test  items.  In  addition,  items  of  ATI  now  come  from 
all  the  passages  of  READ. 

For  test  data  SCI-A,  the  12  items  following  the  last  two  passages  formed  part  of 
ATI.  Just  as  in  HIST-A  and  READ-A,  after  resampling  application  of  DIMTEST,  these 
items  were  removed.  The  resulting  subtest  SCI  with  the  first  28  items  was  still  found  to  be 
multidimensional  (p=.002).  Thus,  a  unidimensional  subset  could  not  be  formed.  Unlike 
reading  test  items,  science  test  items  come  from  distinctly  different  content  areas,  with  a 
moderate  correlation  among  content  areas,  and  require  a  higher  level  of  abstract  reasoning 
and  analytical  skills  than  the  reading  items.  Thus,  in  addition  to  content  areas,  difficulty 
or  speededness  could  have  caused  major  secondary  dimensions  in  this  case. 

For  the  test  data  MATH,  ASIO,  and  AS12,  where  p-values  were  low,  items  of  ATI 
did  not  form  a  subgroup  tapping  a  secondary  ability  as  found  in  HIST-A,  READ-A,  or 
SCI-A.  In  addition  upon  studying  the  content  of  the  items,  it  was  found  these  items  tap 
multiple  major  content  areas.  Therefore  these  test  data.are  treated  as  multidimensional. 


Table  2  about  here 


Table  2  shows  dimensionality  results  of  the  unidimensional  READ  and 
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multidimensional  SCI  test  data  for  different  sample  sizes.  The  p-values  associated  with 
READl  through  READ4  show  evidence  of  a  high  degree  of  essential  unidimensionality 
underlying  the  test  data.  These  results  are  consistent  with  that  of  READ  in  Table  1.  The 
selection  of  items  of  ATI  for  tests  READl  through  READ4  are  highly  varied,  and  yet  they 
consistently  affirm  essential  unidimensionality.  The  results  of  SCIl  through  SCI4  are 
consistent  with  that  of  SCI  in  T&ble  1  in  affirming  multidimensionality  of  the  test  data. 
Items  of  ATI  varied  highly  for  all  four  tests  and  yet  consistently  affirmed 
multidimensionality,  except  for  SCI3. 


Two-dimensioiial  Studies 


Results  of  two-dimensional  reading  and  science  test  data  are  reported  in  Table  3. 
Since  items  that  tap  a  distinct  second  dimension,  from  the  content  perspective,  are  clearly 
known  (in  this  case,  6  SCI  items),  the  science  items  were  forced  to  be  selected  for  ATI. 
This  is  an  example  where  expert  opinion  is  used  to  select  ATI  items.  The  T-  and  p-values 
for  RSI,  RS2,  RS3,  RS4,  and  RS  strongly  confirm  the  two-dimensional  nature  of  these  test 
data.  As  expected,  as  the  sample  size  increases,  the  power  also  increases. 


Table  3,  Table  4  and  Table  5  about  here 


The  results  of  the  two-dimensional  test  data  of  ARCS  and  GSAR  are  reported  in 
Table  4.  Also  in  this  case,  since  items  that  are  used  to  create  these  two-dimensional  data 
are  known  (GS  items  for  ARCS  and  AR  items  for  GSAR),  these  items  were  forced  to  be 
selected  for  ATI.  The  T-  and  p-values  associated  with  all  the  four  tests  strongly  confirm 
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the  moltidiznensionality  of  these  test  data.  For  ARGSl  and  ARGS2,  there  is  a  sharp 
increase  in  T-  and  p-values  as  the  degree  of  contamination,  as  measured  by  the  number  of 
item  responses  contaminated,  increases  from  5  to  10. 

The  results  of  the  two-dimensional  history  and  literature  test  data  are  reported  in 
Table  5.  As  with  other  two-dimensional  tests,  LIT  items  were  forced  to  be  selected  for 
ATI.  Also  in  this  .case,  the  T-  and  p-values  confirm  the  multidimensional  nature  of  these 
data. 

DIMTEST  was  again  applied  to  a  sample  of  test  data  selected  from  two-dimensional 
tests.  This  time  FA  was  used  as  the  method  of  selection  for  ATI  items.  The  purpose  of  this 
analysis  was  to  check  if  the  FA  method  of  selection  of  ATI  items  would  lead  to  the  similar 
p-values  as  with  EO.  The  findings  revealed  that  for  these  tests  FA  could  not  always  ferret 
out  purely  unidimensional  items  from  content  perspective.  The  subtest  ATI  had  a  mixture 
of  items  tapping  both  dimensions,  and  DIMTEST  was  then  able  to  correctly  assess 
dimensionality  only  when  there  were  1000  or  more  examinees  for  computing  the  statistic. 

Discussion  and  Conclusions 

None  of  the  tests  examined  in  the  present  study  are  strictly  unidimensional  in  the 
sense  of  measuring  only  one  ability.  Items,  in  every  test,  are  influenced  by  several 
secondary  abilities  in  addition  to  the  major  ability  intended  to  be  measured.  Based  on 
DIMTEST  analysis,  some  test  data  were  assessed  as  fitting  an  essential  unidimensional 
model  while  others  were  not.  This  depends  upon  whether  the  secondary  abilities  were  major 
or  minor. 

The  unidimensionality  analysis  of  HIST-A,  READ-A,  and  SCI-A  present  interesting 
findings.  For  HIST-A,  the  map  items  had  high  second  factor  loadings  and  thus  were 
selected  for  ATI.  Consequently,  the  computed  T-statistic  was  large,  leading  to  the 
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rejection  of  H  and  implying  that  ATI  items  are  dimensionally  different  from  the  rest  of 
the  test.  Content  analysis  of  HIST-A  reveals  that  HIST-A  consists  of  items  of  United 
States  history  for  different  time  periods  spanning  from  1763  to  present  time.  These  items 
cover  such  a  large  span  of  time  that  the  test  is  surely  slightly  multidimensional  for  this 
reason  alone.  In  addition,  the  test  contains  map  items.  The  map  items,  however,  were 
isolated  and  statistically  confirmed  as  not  measuring  the  same  trait  as  the  rest  of  the  test. 
This  shows  that  the  statistic  T  is  highly  sensitive  to  distinct  major  dimensions  (in  this 
case,  map  items).  The  analysis  of  HIST,  with  map  items  removed,  reveals  that  it  is 
essentially  unidimensional.  Thus  the  statistic  T  seems  to  be  robust  against  relatively  minor 
correlated  abilities  influencing  test  items  while  being  sensitive  to  major  abilities.  Likewise, 
for  the  test  data  READ-A,  multidimensionality  was  caused  by  items  tapping  psychology 
topic  (scientific)  versus  literature  topics  (humanities).  Once  the  psychology  item  responses 
were  removed,  the  remaining  item  responses  could  be  well  modeled  by  an  essential 
unidimensional  model.  In  contrast,  the  multidimensionality  in  SCI-A  was  due  to  not  only 
distinct  major  abilities  but  also  likely  due  to  speededness  of  the  test,  which  in  itself  is  a 
major  determinant.  Moreover,  an  essential  unidimensional  subtest  could  not  be  formed  for 
SCI-A. 

Another  interesting  feature  of  these  analyses  is  that  although  both  READ  and  SCI 
are  paragraph  comprehension  type  test  data,  th^  differ  widely  in  the  degree  of  their 
approximation  to  essential  dimensionality.  The  READ  test  data  has  3  passages  each 
followed  by  10  items,  all  dealing  with  humanities.  Although  these  passages  come  from 
different  sources,  the  model  underlying  the  item  responses  approximates  an  essential 
unidimensional  model.  This  is  an  example  where  a  few  secondary  abilities  (possibly  highly 
correlated)  each  influence  a  large  group  of  items.  In  contrast,  the  SCI  test  data  has  5 
passages  each  followed  by  5  or  6  items.  These  passages,  although  they  deal  with  science  in 
general,  come  from  widely  different  amd  conceptually  difGcult  topics,  and  the  model 
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underlying  the  item  responses  does  not  approximate  an  essential  unidimensional  model. 
This  is  an  example  where  many  secondary  abilities  each  influence  a  small  groups  of  items, 
but  the  strength  of  the  influence  of  these  secondary  abilities  is  such  that  item  responses  can 
not  be  well  modeled  by  an  essential  unidimensional  model.  These  results  are  consistent 
with  simulation  results  of  Nandakumar  (1991)  in  that  the  number  of  items  influenced  by 
secondary  abilities  and  the  strength  of  the  secondary  abilities  present  determine  the  degree 
to  which  the  assumption  of  essential  unidimensionality  is  violated. 

The  results  obtained  in  this  study  are  similar  to  the  results  obtained  by  other 
researchers  who  have  analyzed  some  of  these  data  using  different  statistical  methodologies. 
Zwick  (1987)  performed  dimensionality  analyses  of  HIST-A  and  LIT  by  various  techniques 
to  assess  dimensionality  and  concluded  that  these  are  unidimensional.  Regarding  the  ACT 
data,  it  is  believed  that  MATH  and  SCI  are  multidimensional.  Bock,  Gibbons,  and  Muraki 
(1985)  have  analyzed  ASVAB  test  data  for  a  different  sample  and  found  a  significant 
second  factor  for  arithmetic  reasoning,  general  science,  and  auto  shop  information.  Smce 
the  sample  used  here  is  not  the  same  it  is  hard  to  develop  a  meaningful  comparison. 

The  results  of  two-dimensional  tests  demonstrate  a  very  good  power  of  the  statistic 
T.  The  statistic  T  has  the  capability  to  ignore  minor  secondary  traits,  which  should  be 
largely  discounted,  from  the  major  dominant  traits.  This  is  evidenced  in  several  cases.  The 
test  data  HIST  illustrates  this.  There  is  inherent  multidimensionality  in  HIST  as  it  covers 
a  range  of  time  periods  in  history.  However,  the  p-value  is  above  the  nominal  level  of 
significance,  suggesting  acceptance  of  unidimensionality.  By  contrast,  with  the  additional 
contamination  of  only  5  LIT  items  or  5  map  items,  the  T-value  shoots  up,  indicating 
essential  multidimensionality  of  the  data.  This  remarkable  sensitivity  of  the  statistic  T  to 
major  dimensions  illustrates  its  power. 

These  results,  for  the  first  time,  have  illustrated  both  the  factor  analysis  approach 
and  the  expert  opinion  approach  to  select  items  for  the  subtest  ATI.  Tables  1  and  2  use  FA 
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to  select  ATI  items,  and  Tables  3,  4,  and  5  use  EO.  It  is  evident  that  FA  serves  as  an 
exploratory  tool  and  EO  serves  as  a  confirmatory  tool  in  selecting  items  for  ATI  to  assess 
essential  dimensionality. 

The  dimensionality  of  a  given  set  of  item  responses  in  certain  sense  is  a 
continuum — one  cannot  determine  whether  a  given  data  of  responses  generated  by  a  set  of 
items  to  an  examinee  sample  is  truly  essentially  unidimensional  or  tmly  multidimensional; 
one  can  only  approximate.  Although  the  exact  number  of  dimensions  in  an  IRT  model  is 
rigorously  defined  for  a  finite  length  test,  the  number  of  dominant  dimensions — ^whether 
determined  by  Stout's  essential  dimensionality  conceptualization  or  by  some  other 
conceptualization — is  only  rigorously  definable  for  an  infinitely  long  test.  In  other  words, 
for  a  finite  test  (that  is,  for  any  real  test  data)  it  is  a  judgment  call  whether  a  particular 
IRT  model  is  seen  as  having  one,  or  more  than  one,  dominant  dimension,  based  upon  where 
on  the  continuum  the  amount  of  multidimensionality  falls.  One  consequence  of  this  is  that 
the  performance  of  ability  estimation  procedures  such  as  LOGIST  or  BILOG  needs  to  be 
addressed  in  the  context  of  the  assessment  of  the  amount  of  lack  of  unidimensionality.  In 
this  regard,  indices  of  lack  of  essential  unidimensionality  developed  by  Junker  and  Stout 
(1991)  will  be  extremely  useful.  These  indices  can  be  used  to  decide  when  it  is  safe  to  use 
unidimensional  estimation  procedures  such  as  LOGIST  and  BILOG  to  arrive  at  accurate 
estimates  of  ability. 

In  cases  where  approximation  of  essential  unidimensional  model  to  the  data  is  in 
question,  there  are  variou;>  alternatives.  The  test  items  can  be  split  into  essential 
unidimensional  subtests  (for  example,  HIST-A  and  READ-A).  Another  possible  approach 
is  to  investigate  the  applicability  of  the  concept  of  "testlet"  to  the  data  (Rosenbaum,  1988; 
Thissmi,  Steinberg,  and  Mooney,  1989).  If  the  assumption  of  local  independence  is  violated 
within  the  passages  but  maintained  among  the  passages,  the  theory  of  testlets  promises 
unidimensional  scoring  for  such  tests.  The  test  data  SCI-A  and  SCI  could  fall  into  this 
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category.  Multidimensional  modeling  can  be  applied  if  either  of  the  above  procedures  can 
not  be  applied  (Reckase,  1989). 
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Table  1 

Results  of  =  1,  a  =  .05 


Results  of 

dg  =  a  = 

.05 

No.  of 

No.  of 

T 

P 

Selection  of 

M 

Items  of 

items 

examinees 

ATI  items 

ATI 

1 

30 

750 

.05 

.480 

FA 

5 

11,12,13,15,17 

2 

30 

1000 

.48 

.317 

FA 

7 

1,2,6,11,12,13,21 

30 

1250 

-.06 

.524 

FA 

7 

2,4,6,9,11,12,13 

30 

2000 

1.01 

.155 

FA 

5 

1,11,12,13,16 

28 

750 

1.89 

.029 

FA 

7 

1,3,4,5,17,20,21 

28 

1000 

3.19 

.007 

FA 

6 

8,12,14,18,20,24 

28 

1250 

1.38 

.080 

FA 

7 

6,9,10,11,19,25,28 

28 

2000 

2.91 

.001 

FA 

7 

8,9,10,11,12,19,22 

Table  3 

Results  of  d^=l  for  two-dimensional  tests: 
READ  &  SCI;  a=.05 


Test 

No.  of 
Items 
RAED  SCI 

No.  of 
Examinees 

T 

P 

Selection  of 
ATI  items 

M 

Items  of 

ATI 

RSI 

MM 

6 

1.92 

.020 

EO 

31,32,33,34,35,36 

RS2 

■9 

6 

2.72 

.003 

EO 

6 

31,32,33,34,35,36 

RS3 

HI 

6 

3.71 

.0001 

EO 

6 

31,32.33,34,35,36 

RS4 

30 

6 

3.32 

.0005 

EO 

6 

31,32,33,34,35,36 

RS 

6 

6.83 

.0000 

EO 

6 

31,32,33,34,35,36 

Results  of  H  :  dp  = 

Table  4 

1  for  two-dimensional  tests: 

AR  R  6S;  a=.05 

Test 

No.  of 
Items 

AR  6S 

No.  of  T 

Examinees 

p  Selection  of 
ATI  items 

M  Items  of 

ATI 

AR6S1 

AR6S2 

6SAR1 

GSAR2 

30  5 

30  10 

25  5 

25  10 

1853  2.85 

1853  6.15 

!811  4.29 

1811  4.06 

.002  EO 

.000  EO 

.000  EO 

.000  EO 

5  31,32,33,34,35 

10  31,32,33,34,35, 

36,37,38,39,40 

5  26,27,28,29,30 

10  26,27,28,29,30, 

31,32,33,34,35 

Results  of  /T  :  = 

n  ft 

Table  5 

1  for  two-dimensional  tests: 

HIST  ft  LIT;  a=.05 

1  JSt 

No.  of 
Items 
HIST  LIT 

No.  of  T 

Examinees 

p  Selection  of 
ATI  items 

M  Items  of 

ATI 

HSTLITl 

31 

5 

2428 

3.01 

.036 

EO 

5 

32,33,34,35,36 

ilSTLIT2 

31 

8 

2428 

3.38 

.000 

EO 

8 

32,33,34,35,36, 

37,38,39 

HSTLIT3 

31 

10 

2428 

2.03 

.021 

EO 

10 

32,33,34,35,36, 

37,38,39,40,41 
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